18 research outputs found
On Romanization for Model Transfer Between Scripts in Neural Machine Translation
Transfer learning is a popular strategy to improve the quality of low-resource machine translation. For an optimal transfer of the embedding layer, the child and parent model should share a substantial part of the vocabulary. This is not the case when transferring to languages with a different script. We explore the benefit of romanization in this scenario. Our results show that romanization entails information loss and is thus not always superior to simpler vocabulary transfer methods, but can improve the transfer between related languages with different scripts. We compare two romanization tools and find that they exhibit different degrees of information loss, which affects translation quality. Finally, we extend romanization to the target side, showing that this can be a successful strategy when coupled with a simple deromanization model
Identifying Weaknesses in Machine Translation Metrics Through Minimum Bayes Risk Decoding: A Case Study for COMET
Neural metrics have achieved impressive correlation with human judgements in the evaluation of machine translation systems, but before we can safely optimise towards such metrics, we should be aware of (and ideally eliminate) biases toward bad translations that receive high scores. Our experiments show that sample-based Minimum Bayes Risk decoding can be used to explore and quantify such weaknesses. When applying this strategy to COMET for en-de and de-en, we find that COMET models are not sensitive enough to discrepancies in numbers and named entities. We further show that these biases are hard to fully remove by simply training on additional synthetic data and release our code and data for facilitating further experiments
ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics
As machine translation (MT) metrics improve their correlation with human
judgement every year, it is crucial to understand the limitations of such
metrics at the segment level. Specifically, it is important to investigate
metric behaviour when facing accuracy errors in MT because these can have
dangerous consequences in certain contexts (e.g., legal, medical). We curate
ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging
from simple perturbations at the word/character level to more complex errors
based on discourse and real-world knowledge. We use ACES to evaluate a wide
range of MT metrics including the submissions to the WMT 2022 metrics shared
task and perform several analyses leading to general recommendations for metric
developers. We recommend: a) combining metrics with different strengths, b)
developing metrics that give more weight to the source and less to
surface-level overlap with the reference and c) explicitly modelling additional
language-specific information beyond what is available via multilingual
embeddings.Comment: preprint for WMT 202
ACES: Translation Accuracy Challenge Sets at WMT 2023
We benchmark the performance of segmentlevel metrics submitted to WMT 2023
using the ACES Challenge Set (Amrhein et al., 2022). The challenge set consists
of 36K examples representing challenges from 68 phenomena and covering 146
language pairs. The phenomena range from simple perturbations at the
word/character level to more complex errors based on discourse and real-world
knowledge. For each metric, we provide a detailed profile of performance over a
range of error categories as well as an overall ACES-Score for quick
comparison. We also measure the incremental performance of the metrics
submitted to both WMT 2023 and 2022. We find that 1) there is no clear winner
among the metrics submitted to WMT 2023, and 2) performance change between the
2023 and 2022 versions of the metrics is highly variable. Our recommendations
are similar to those from WMT 2022. Metric developers should focus on: building
ensembles of metrics from different design families, developing metrics that
pay more attention to the source and rely less on surface-level overlap, and
carefully determining the influence of multilingual embeddings on MT
evaluation.Comment: Camera Ready WMT 2023. arXiv admin note: text overlap with
arXiv:2210.1561
Evaluating the Effectiveness of Natural Language Inference for Hate Speech Detection in Languages with Limited Labeled Data
Most research on hate speech detection has focused on English where a
sizeable amount of labeled training data is available. However, to expand hate
speech detection into more languages, approaches that require minimal training
data are needed. In this paper, we test whether natural language inference
(NLI) models which perform well in zero- and few-shot settings can benefit hate
speech detection performance in scenarios where only a limited amount of
labeled data is available in the target language. Our evaluation on five
languages demonstrates large performance improvements of NLI fine-tuning over
direct fine-tuning in the target language. However, the effectiveness of
previous work that proposed intermediate fine-tuning on English data is hard to
match. Only in settings where the English training data does not match the test
domain, can our customised NLI-formulation outperform intermediate fine-tuning
on English. Based on our extensive experiments, we propose a set of
recommendations for hate speech detection in languages where minimal labeled
training data is available.Comment: 15 pages, 7 figures, Accepted at the 7th Workshop on Online Abuse and
Harms (WOAH), ACL 202
On Biasing Transformer Attention Towards Monotonicity
Many sequence-to-sequence tasks in natural language processing are roughly
monotonic in the alignment between source and target sequence, and previous
work has facilitated or enforced learning of monotonic attention behavior via
specialized attention functions or pretraining. In this work, we introduce a
monotonicity loss function that is compatible with standard attention
mechanisms and test it on several sequence-to-sequence tasks:
grapheme-to-phoneme conversion, morphological inflection, transliteration, and
dialect normalization. Experiments show that we can achieve largely monotonic
behavior. Performance is mixed, with larger gains on top of RNN baselines.
General monotonicity does not benefit transformer multihead attention, however,
we see isolated improvements when only a subset of heads is biased towards
monotonic behavior.Comment: To be published in: Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computational Linguistics: Human
Language Technologies (NAACL-HLT 2021
Building a Parallel Corpus on the World's Oldest Banking Magazine
We report on our processing steps to build a diachronic parallel corpus based on the world's oldest banking magazine. The magazine has been published since 1895 in German, with translations in French and partly in English and Italian. Our data sources are printed issues (until 1997), PDF issues (since 1998) and HTML files (since 2001). The corpus building poses special challenges in article boundary recognition and cross-language article and sentence alignment. Our corpus fills a gap in parallel corpora with respect to genre (magazine articles), domain (banking and economy articles), and its time span (120 years)